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Abstract 

Markov  chain  Monte  Carlo  (MCMC)  methods  are  currently  enjoying  a  surge  of  interest  within 
the  statistical  community.  The  goal  of  this  work  is  to  formalize  and  support  two  distinct  adap¬ 
tive  strategies  which  typically  accelerate  the  convergence  of  a  MCMC  algorithm.  One  approach  is 
through  resampling;  the  other  incorporates  adaptive  switching  of  the  transition  kernel.  Support 
is  both  by  analytic  arguments  and  simulation  study.  Application  is  envisioned  in  low  dimen¬ 
sional  but  non-trivial  problems.  Two  pathological  illustrations  are  presented.  Connections  with 
reparametrization  are  discussed  as  well  as  possible  difficulties  with  infinitely  often  adaptation. 


Key  words:  Adaptive  chains;  Gibbs  sampler;  X1-convergence;  Markov  chain  Monte  Carlo; 
Metropolis-Hastings  algorithm;  Rejection  method;  Resampling. 


1  Aeoesslon  For 

irris  GRA&I 
DTIC  TAB 
Unanr>.oT*uced 
JustUloati 

SK 

□ 

□ 

on _ _ 

- , 

Bv  J 

button/..  ! 

Availability, fteaoe 

Hat 

fc' 

Avert! 

Spec 

and/or 

ial 

1  Introduction 


Markov  chain  Monte  Carlo  (henceforth  MCMC)  algorithms  are  currently  experiencing  a  surge 
of  interest  within  the  statistical  community.  Though  they  first  appeared  in  the  literature  forty  years 
ago  (Metropolis  et  a l.  1953),  the  present  enthusiasm  arises  from  their  much  more  recent  recognition 
as  powerful  tools  for  implementing  a  wide  range  of  statistical  inference.  For  instance  in  Bayesian 
inference,  MCMC  simulation  enables  calculation  of  features  of  the  posterior  distribution  of  model 
unknowns  given  the  observations  (Gelfand  and  Smith  1990).  For  maximum  likelihood  estimation, 
where  the  likelihood  is  specified  as  a  non-norm  alized  joint  density  for  the  observations,  MCMC 
techniques  can  be  used  for  carrying  out  the  maximization  (Geyer  and  Thompson  1992;  Gelfand 
and  Carlin  1993). 

In  either  case  the  object  we  wish  to  learn  more  about  is  a  function,  say  /( u),  which  we  assume 
is  strictly  positive  and  integrable  with  respect  to  a  measure  /x  over  a  set  of  interest  denoted  by  U. 
In  practice  ft  is  a  product  of  Lebesgue  and/  or  counting  measures.  Utilizing  the  duality  between 
population  and  sample  we  propose  to  learn  about  /  by  drawing  samples  from  the  normalized 
version  of  /,  the  density  A(u)  =  /(u)//w  /(u)dft(u).  MCMC  techniques  permit  handling  of  high 
dimensional  u  but  there  is  also  considerable  interest  in  cases  where  dimension  is  not  so  large  but 
asymptotic  approximations  are  inappropriate  and  analytic  methods  are  infeasible  (see  Gelfand, 
Smith  and  Lee  1992  for  a  broad  range  of  examples).  This  is  the  situation  we  consider  with  the 
objective  being  to  propose  and  justify  modifications  to  MCMC  techniques  which  tend  to  hasten 
convergence. 

A  few  words  regarding  convergence  of  MCMC  techniques  are  appropriate.  There  is  by  now 
a  substantial  theoretical  literature  providing  conditions  for  and  rates  of  convergence  of  MCMC 
algorithms.  We  mention  a  few:  Geman  and  Geman  (1984);  Tierney  (1994);  Applegate  et  al.  (1991); 
Roberts  and  Smith  (1992);  Schervish  and  Carlin  (1992).  In  application,  convergence  conditions  are 
readily  checked  but  theoretical  rates,  usually  in  the  form  of  bounds,  seem  to  be  of  little  practical 
value.  As  a  result  there  has  also  arisen  a  substantial  discussion  of  convergence  diagnostics,  using  the 
output  stream  of  the  MCMC  algorithm,  e.g.,  Ritter  and  Tanner  (1992),  Raftery  and  Lewis  (1992), 
Zellner  and  Min  (1992),  Gelman  and  Rubin  (1992)  and  Geyer  (1992).  Since  pathological  choices 
of  /( u)  can  either  deceive  or  render  inapplicable  such  diagnostics,  a  cynical  viewpoint  suggests 
that  we  can  only  hope  to  demonstrate  lack  of  convergence  rather  than  convergence  (Clifford  1993). 
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However,  in  practice,  a  broad  range  of  starting  values  and  a  bit  tuning  of  tbe  MCMC  algorithm 
usually  enable  the  user  to  feel  comfortable  with  regard  to  convergence. 

In  the  present  work  we  do  not  consider  the  problem  of  assessing  convergence.  Rather  our 
goal  is  to  formalize  and  support  two  distinct  adaptive  strategies  for  accelerating  the  convergence 
of  a  MCMC  algorithm  for  low  dimensional  but  non-trivial  problems.  Practically  useful  adaptive 
accelerators  must  be  built  solely  from  the  iterations  of  the  MCMC  algorithm  and  must  not  be  too 
costly  to  implement.  This  is  the  case  with  our  proposals  though  we  anticipate  refinement.  Support 
is  provided  both  through  theory  and  simulation.  Related  work  of  Gilks  et  al.  (1992)  describes 
a  technique  called  adaptive  direction  sampling  which,  though  not  discussed  here,  falls  under  the 
umbrella  of  our  second  class  of  strategies.  We  confess  at  the  outset  that,  in  using  adaptive  strategies, 
we  not  guarantee  faster  convergence.  Occasionally  we  may  be  led  in  the  “wrong  direction". 
However,  as  we  argue  in  the  sequel,  the  proposed  adaptation  never  compromises  convergence  itself. 

It  is  well  known  that  all  numerical  integration  approaches  perform  better  when  correlation 
amongst  the  components  of  u  is  low.  In  particular,  MCMC  algorithms  converge  more  rapidly. 
Hence  acceleration  might  be  attempted  through  a  reparametrization  to  approximate  orthogonal¬ 
ity.  An  effective  linear  transformation  to  achieve  roughly  uncorrelated  components  of  u  might  be 
identified  through  analytic  investigation  of  /( u)  or  adaptively  from  early  output  of  the  MCMC 
algorithm  (see  Muller  1994  in  this  regard).  Such  orthogonalization  need  not  always  be  worthwhile. 
For  instance,  with  a  Gibbs  sampler,  conjugacies,  which  permit  convenient  draws  from  complete  con¬ 
ditional  distributions,  may  be  sacrificed.  In  any  event,  such  reparametrization  is  not  in  competition 
with  our  approaches,  rather  it  may  be  employed  in  conjunction  with  them. 

The  format  of  the  paper  is  as  follows.  Section  two  provides  a  brief  review  of  aspects  of  MCMC 
simulation  which  we  require,  as  well  as  a  description  of  the  two  acceleration  strategies.  The  first  is 
based  upon  resampling  and  support  is  provided  in  Section  3.  The  second  is  based  upon  adaptive 
modification  of  the  transition  kernel  and  support  is  presented  in  Section  4.  Section  5  provides 
two  useful  examples.  The  first  fits  a  nonlinear  model  to  a  real  data  set  resulting  in  a  very  poorly 
behaved  likelihood.  The  second  reveals  that  adaptation  done  infinitely  often  can  modify  the  ergodic 
behavior  of  the  MCMC  algorithm. 
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2  MCMC  Algorithms  and  Accelerators 


MCMC  algorithms  proceed  from  the,  possibly  surprising,  idea  of  creating  a  stationary  Markov 
chain  whose  invariant  distribution  is  S  (here  3  is  the  probability  measure  associated  with  the 
density  h{ u),  i.e.,  h  =  ).  We  work  with  two  such  algorithms  here,  the  Gibbs  sampler  (see,  e.g., 

Gcman  and  Geman  1984;  Gelfand  and  Smith  1990)  and  the  Metropolis-Hastings  algorithm  (see, 
e.g.,  Hastings  1970;  Tierney  1994). 

2.1  Notation  and  Review 

We  introduce  a  bit  of  notation.  For  a  discrete  time  stationary  Markov  chain  we  denote  its  state  at 
time  t  by  and  its  transition  kernel  by  P,  i.e.,  for  a  /i- measurable  set  A,  P(u(t-X);  A)  = 

P(u(0  €  Alu^4-1)).  The  distribution  H  is  an  invariant  distribution  for  P  if  for  all  ^—measurable  sets 
A,  3(A)  =  /  P(u;  A)dB(u).  If  P(‘)(u;  A)  =  P(uW  6  A|u(°)  =  u)  then  the  invariant  distribution 
H  is  an  equilibrium  distribution  if 

lim  pW(U;  A)  =  IT( A)  (1) 

*—♦00 

for  all  /*— measurable  sets  A.  Expression  (1)  captures  the  basis  of  the  MCMC  simulation  approach: 
after  a  sufficient  number  of  transitions,  uW  is  approximately  distributed  according  to  H.  All 
discussion  of  convergence  diagnosis  concerns  itself  with  assessing  when  t  is  large  enough  so  that  the 
"distance”  between  pM,  which  can  rarely  be  computed  analytically,  and  3  is  small. 

We  can  also  work  with  the  marginal  distribution  of  uW,  say  i.e.,  BW(A)  =  J  pW(u;  A) 
dp(°)(u)  where  3^  is  the  distribution  of  starting  states.  If  (1)  holds  then  limt_oo  .JJW(A)  =  3(A) 
for  all  n— measurable  sets  A.  Of  course  JfW  also  will  not  be  available  explicitly  but  3^  lends  itself 
to  Monte  Carlo  integration  through  the  relationship 

tfW(A)  =  J  P(u;A)dS^~1\u).  (2) 

The  Gibbs  sampler  creates  a  transition  horn  u^-1^  to  as  follows.  If  we  partition  u 
into  k  blocks,  i.e.,  u  =  (ui,  U2,  •  •  • ,  Ufc),  we  obtain  as  a  draw  from  the  conditional  density, 
h(ui|uj-1^,u^<-1\-"»ufc-1^)-  We  then  obtain  as  a  draw  from  A(u2|u^,  U3 -1\-  •  • , Uj.4”1^), 
etc.  until  finally  is  drawn  from  fi(u*|u^,  u^,  •  •  • ,  u^2x)  whence  u  has  been  fully  updated. 


3 


(3) 


Thus,  the  transition  kernel  of  the  Gibbs  sampler  has  a  density  p(u;  v)  taking  the  form 

k 

p(u;  v)  =  JJ  fc(vt|vi,  •  •  • ,  Vi_a,  iu+1,  •  •  - ,  ufc). 

i=i 

Note  that  all  of  the  conditional  densities  on  the  right  hand  side  of  (3)  are  proportional  to  /  evaluated 
at  the  corresponding  arguments.  If  any  of  these  densities  are  not  standard,  normalization  of  /  is 
needed  in  order  that  p(u;v)  can  be  computed.  Therefore,  for  the  Gibbs  sampler,  (2)  can  be 
differentiated  yielding  the  marginal  density  of 

h^(u)  =  J  p(z;  u)h(‘-1)(z)  dti(z).  (4) 

The  general  Metropolis-Hastings  algorithm  updates  ul1-1)  to  u(*)  as  follows.  Suppose  Q  is  a 
Markov  transition  kernel  having  strictly  positive  density  q(  u;  v)  with  respect  to  ft,  i.e.  <?KA)  = 

Sa  ?(«;  v)Mv)-  Let  «(u; v)  =  min  (i>  /(djfevj)  “d  defiae  r(“)  =  i  -  Su  <*(“;  v)?(«;  v)Mv)- A 
transition  from  to  uW  is  made  by  drawing  a  candidate  transition  state  v  from  v). 

Then,  with  probability  v)  we  move  from  to  v  and  take  =  v;  with  probability 

1  —  a(u^t_1^ ;  v)  we  stay  at  and  take  Thus  r(u(t-1))  is  the  chance  of  not  moving 

and  the  transition  kernel  takes  the  form 

P(u;A)  =  /  a(u;v)?(u;v)d#*(v)  +  r(u)tfu(A)  (5) 

J  A 

where  6U  denotes  the  degenerate  distribution  at  u,  i.e,  S-a(A)  =  1  if  u  6  4,  0  if  u  £  A.  Thus 
P(u;  A)  arises  as  a  mixed  measure.  Returning  to  (2),  in  the  Metropolis-Hastings  case,  though  pW 
is  atomic,  is  absolutely  continuous  with  respect  to  n  if  the  starting  distribution  is.  In 
fact,  using  (2)  and  (5),  direct  calculation  of  the  Rndon-Nikodym  derivative  yields 

*<0.4*2  =  .«-»+*«-»  (6) 

an 

where  s(t-1)(u)  =  /  a(z;  u)q(z;  u)fc(t-1)(z)<fyi(z). 

2.2  Acceleration  Approaches 

The  term  “acceleration”  is  used  in  numerical  analysis  to  indicate  hastening  of  convergence.  Anal¬ 
ogously,  for  an  MCMC  algorithm,  an  acceleration  approach  is  a  technique  to  diminish  the  number 
of  transitions,  t,  such  that  is  approximately  distributed  according  to  h.  We  consider  two  types 
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of  accelerators.  We  present  them  in  the  setting  of  say  m  parallel  chains  whence  at  any  iteration 
t  we  have  m  i.i.d  observations.  The  approaches  are  more  easily  described  in  this  case  and  the 
independence  simplifies  analytic  arguments.  However,  several  authors  advocate  the  use  of  a  few, 
perhaps  even  a  single  chain  as  most  sensible  and  efficient  (see,  e.g.,  Geyer  1992).  Intuitively,  we 
would  expect  acceleration  to  ensue  even  if  we  used  rrt  iterates  from  a  single  chain  possibly  after  an 
initial  bum-in  and  possibly  with  spacing  to  reduce  dependence. 


2.3  Acceleration  through  Resampling 


Suppose,  then,  that  we  have  u ^\j  =  1,2, i.i.d  for  each  t.  The  first  type  of  accelerator 
attempts  to  convert  the  sample  from  hW  to  a  sample  approximately  from  h.  The  idea  is  that  if 
our  current  marginal  distribution  is  moved  “closer”  to  k,  within  fewer  transitions  our  draws  will  be 
essentially  from  h.  Indeed,  if  our  current  were  drawn  exactly  from  h,  then  all  subsequent  draws 
would  be  as  well.  Such  conversion  can  be  accomplished  using  ideas  in  Smith  and  Gelfand  (1992) 
who  suggest  two  resampling  methods,  the  rejection  method  (see  also  Devroye  1986  and  Ripley 
1987)  and  the  weighted  bootstrap  (see  also  Rubin  1988). 


Applied  to  our  setting  the  rejection  method  proceeds  as  follows.  Compute  M  =  supu 
For  each  draw  Zj  ~  U(0, 1).  If  Zj  <  retain  u^  as  a  draw  from  h.  The  weighted 

bootstrap  computes,  for  each  u^,  the  weights  vtj  =  and  g,-  =  ^mJ  v  .  The  axe  then 

resampled  acording  to  the  probabilities  qj.  A  resampled  u*  is  approximately  distributed  according 
to  h.  The  rejection  method  is  desirable  in  that  retained  observations  have  exactly  the  distribution 
h.  However,  computation  of  M  is  usually  difficult  and  only  a  portion  (often  small)  of  the  m 
are  retained.  Hence,  in  practice  and  in  our  simulation  investigation  we  use  the  weighted  bootstrap. 
It  is  easy  to  implement  and  permits  as  many  resampled  observations  as  desired.  However,  their 
distribution  is  only  approximately  h. 


Unfortunately,  except  in  the  simplest  cases,  hi*)  will  not  be  known,  so,  in  implementing  a 
resampling  method,  we  must  replace  fcW  by  an  estimate  fcM.  This  extra  level  of  approximation 
should  not  be  viewed  as  a  deterrent.  Our  goal  is  only  to  draw  from  a  distribution  which  we  expect 
to  be  “closer”  to  h  than  fcW.  In.  fact,  denoting  a  resampled  draw  by  u*(‘),  we  would  make  the  next 
transition  via  P  to  obtain  u(t+1).  Indeed,  if  u*  ~  h,  then  so  would  all  subsequent  draws  under  this 
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chain. 

How  shall  we  estimate  &(*)?  One  possibility  is  a  kernel  density  estimate  based  upon  the 
(see,  e.g.,  Silverman  1986).  An  alternative  arises  using  Monte  Carlo  integration.  In  particular, 
corresponding  to  (4),  we  obtain 

A<‘)(u)  =  u)  (7) 

i= i 

where  p(.; .)  is  defined  in  (3).  Monte  Carlo  estimation  of  h  is  more  computationally  demanding 
than  a  kernel  density  estimate.  However,  the  Hao-Blackwellization  encompassed  in  (7)  suggests, 
under  a  wide  range  of  loss  functions,  a  better  estimate  of  h  will  result  (see  G  elf  and  and  Smith 
1990  in  this  regard).  The  form  (7)  was  used  in  the  simulation  investigation  of  the  next  section. 
Corresponding  to  (6),  if  we  assume  hW  as  then  A  Monte  Carlo  estimate  of 

is  straightforward  again  using  the  For  a  given  u,  we  can  obtain  a  Monte  Carlo  estimate  of 

r(u)  by  making  draws  v,-  given  u  from  $(u;  v). 

Note  that  the  proposed  adaptive  modification  of  the  MCMC  algorithm  changes  the  distribution 
at  the  t**1  iteration  from  hW  to  a  random  distribution  h  and  results  in  a  non-Markovian  transition. 
Were  we  to  do  this  at  each  iteration  there  is  no  assurance  that  the  resulting  MCMC  algorithm  con¬ 
verges  or  that  it  converges  to  h  (see  Section  5.2).  However,  we  envision  such  adaptive  modification 
for  only  a  few  iterations,  thereafter  running  a  stationary  chain  whose  invariant  distribution  is  h  so 
convergence  is  assured. 

2.4  Acceleration  by  Changing  P 

The  second  type  of  acceleration  replaces  the  current  transition  kernel  say  Pi  by  a  new  choice  Pj 
which  is  “better  than”  Pi  .  What  do  we  mean  by  “better  than”  ?  It  is  easier  to  work  with  a  finite 
state  space  (and,  of  course,  so  do  computing  machines)  whence  transition  kernels  become  stochastic 
matrices.  Two  such  matrices  having  the  same  unique  invariant  distribution  may  be  compared  using 
their  eigenvalues.  In  particular  if  P  is  r  x  r  having  eigenvalues  1  =  fa  >  02  >  •  •  •  >  0r,  let 
£(•>  =  max  (fa,  I  A-  I)  •  Then  /?(*>  can  be  used  to  bound  the  distance  between  pW  and  h.  This  is 
usually  referred  to  as  Perron- Frobenius  theory  and  a  weak  result  (see,  e.g.,  Iosifescu  1980)  is  the 
following.  Suppose  the  invariant  distribution  h  is  written  as  an  r  x  1  vector,  h,  and  we  define  the 
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r  x  r  matrix,  say  3  =  (h,  h,  —  h).  Noting  that  pW  is  in  fact  (P)‘,  we  have 

(p)*  -ht  =  o  (8) 

where  7  is  the  multiplicity  of  P^*\  Expression  (8)  is  not  as  useful  as  we  would  like  since  it  does  not 
provide  an  explicit  bound  on  a  cell  difference. 

A  stronger  result  is  available  if  P  is  reversible,  i.e.,  if  hrPrt  =  h,P„  for  all  r  and  s.  Diaconis  and 
Stroock  (1991)  obtain  a  bound  on  the  variation  distance  to  equilibrium  in  terms  of  /?(*).  Defining 
this  distance  for  probability  distributions  h  and  g  to  be  j|  h  -  g  ||=  Y*i  \hi  —  gi\/2  we  have 

Theorem  l(Diaconis  and  'Stroock):  If  P  is  a  reversible  Markov  chain  with  unique  invariant 
distribution  h  and  P  is  irreducible  then  for  all  j  and  t 

l)  <  (9) 

Thus  for  Pi,  t  =  1,2,  with  associated  p\*^  we  will  say  that  P2  is  better  than  P\  if  0^  <  Pi^- 

Again  considering  m  parallel  chains  each  taken  to  the  iteration,  the  proposal  is  to  utilize  the 
to  adaptively  revise  Pi  to  P2  which  we  expect  to  be  better  than  Pi.  For  a  Metropolis-Hastings 
algorithm  this  means  adaptive  modification  of  the  proposal  transition  kernel  q(u;  v).  Such  change 
results  in  a  chain  which  is  no  longer  stationary  and  hence  need  not  converge  to  h  (see  Section 
5.2).  As  at  the  end  of  Section  2.3,  we  propose  only  a  few  adaptive  transitions  before  settling 
into  a  stationary  chain,  again  to  insure  convergence.  Analytic  justification  and  simulation  support 
is  provided  in  Section  4.  The  idea  of  switching  between  different  transition  mechanisms  as  the 
simulation  proceeds  is  discussed  in  Besag  and  Green  (1993).  There,  the  issue  is  one  of  conflicting 
demands  upon  an  MCMC  algorithm,  namely  speed  of  convergence  vs.  small  variability  in  estimation 
using  ergodic  averages.  For  the  former  we  want  0^  small.  For  the  latter  they  show  that  small 
values  of  help,  i.e.,  negative  eigenvalues  help.  Our  proposed  switching  is  only  concerned  with 
the  first  demand. 

2.5  Additional  Remarks 

We  have  ignored  the  cost  in  computing  time  to  implement  the  implementation.  This  cost  is  clearly 
problem  specific  and  might  offset  the  benefit  of  acceleration.  In  our  simulation  investigation  this 
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was  not  the  case.  We  might  expect  resampling  to  be  more  effective  than  changing  P,  since  the 
explicit  objective  of  tte  former  is  to  sample  approximately  from  h  rather  than  to  improve  upon  the 
current  P.  Hov.ver,  so  many  factors  affect  the  convergence  of  an  adaptive  procedure  that  such  a 
conclusion  is  not  generally  supportable.  Both  approaches  are  rather  sensitive  to  the  choice  of  m. 
If  m  is  too  small  a  poor  estimate  of  h.M  or  a  poor  revision  of  P  may  arise  yielding  a  potentially 
adverse  effect  on  convergence. 


3  Acceleration  through  Resampling 


Here  we  farther  investigate  the  resampling  approach  of  Section  2.3.  Again  our  goal  is  to  resample 

from  the  set  of  u =  1, 2, . . . ,  m,  ii.d  a  new  sample  u*^,j  =  1, 2, . . . ,  m*  whose  distribution 

h  is  closer  to  h  than  h.W  is.  Recall  that,  since  hi*)  is  not  available,  an  estimate,  hi*)  must  be  used 

;(*) 

to  implement  the  resampling.  Hence  h  is  random. 

In  Section  3.1  we  argue  that  such  resampling  works  in  the  sense  that,  under  mild  conditions, 
the  variation  distance,  Jm  =||  h  —  h  ||— ►  0  o.s.  asm-*  oo.  The  practical  implication  is  that  for 
a  suitably  large  m,  resampling  may  be  expected  to  produce  draws  from  a  distribution  “closer"  to 
h.  In  Section  3.2  we  present  some  very  encouraging  simulation  results  regarding  the  distribution 
of  Jm,  in  comparison  with  the  constant  ||  h(*)  -  h  ||. 


3.1  Theoretical  Results 


Peca-lliTig  the  rejection  method  described  in  Section  2.3,  suppose  M  =  supu  *  ~  U(0, 1)  but 
u  ~  pi(u).  Suppose  further  that  we  keep  u  if  z  <  iJ^fey  Then  the  density  of  u  is 


/(u)gi(u)  .  f  /(tt)gx(tt), 

S2(u)  '  J  5a(“) 


(10) 


The  proof  is  the  same  as  that  of  the  usual  rejection  method  (see,  e.g.,  Ripley  1987).  Similarly,  for  the 
weighted  bootstrap  of  Section  2.3,  if  we  sample  u j  i.i.d  pi(u),  j  =  1, 2, . . . ,  m  but  we  resample  using 
weights  vi j  —  /(uj)/<7a(uj)  it  is  straight  forward  to  show  that  the  distribution  we  are  approximately 
sampling  is  again  (10).  In  our  setting  g\  is  hi*),  gt  is  hi*)  whence 

j-(0  _  /(u)JjM(u)  f  /(u)h(O(u) 

*  [u)  'J 


M*)(  u) 


W‘)(u) 


-du. 


(ID 


8 


We  set  t  —  1,  without  loss  of  generality.  We  assume  that  hi1)  converges  a.s.  to  hi1).  This  is  the 

case  with  usual  kernel  density  estimates  (see  Devroye  and  Gyorfi  1985)  as  well  as  estimator  (7).  In 

either  case,  hi1)  is  of  the  form  m-1  g,( u)  where  gj  is  a  (random)  density.  Hence  the  numerator 

of  (11)  converges  a.s.  to  /( u)  as  m  — »  oo.  If,  as  m  — *  oo,  the  denominator  converges  to  / /( u),  the 
:(1) 

density  h  converges  a.s.  to  the  density  h.  Then  by  a  standard  theorem  (see,  e.g.,  Glick  1974) 
j(i) 

f\h  —  h|  — *■  0  a.s.  as  m  — ►  oo. 

Thus  we  investigate  the  limiting  behavior  of  J  fh^1)/ hi1).  We  write  U1)  as  to  :ndicate 
that  it  is  an  average  over  m  terms.  In  fact  the  i.i.d  sequence  {u fP\j  =  1, 2, . . .}  drawn  from  hl°) 
determines  {hm^,m  =  1,2,...}  under  (7).  The  ia.d.  sequence  {u^,j  =  1,2,...}  drawn  from  h l1) 
determines  m  =  1, 2, . . .}  under  a  kernel  density  estimate.  In  either  case,  it  is  clear  that  for 
certain  sequences  the  random  functions  of  u,  h(1)(u)/h^^(u),  can  be  badly  behaved  in  the  tails 
and  that,  with  fm  =  fh^/hJ^,  f  fm  need  not  exist. 

Fortunately  in  practice,  the  situation  is  more  encouraging.  For  the  Gibbs  sampler,  suppose  /  is 
continuous  and  strictly  positive,  as  it  will  be  in  most  statistical  applications  where  u  is  a  parameter 
vector.  Since  the  complete  conditional  distributions  associated  with  /  are  all  proportional  to  /, 
from  (3),  the  p(u^t-1^;  u)  are  continuous  and  strictly  positive.  Hence,  from  (7),  hi*)  and  thus  fm  axe 
as  well.  Using  a  kernel  density  estimate  for  hi1)  based  upon  a  kernel  function  which  is  continuous 
and  strictly  positive  where  /  is,  again,  fm  will  be  as  well. 

Fix  an  underlying  sequence  such  that  /m(u)  converges  to  /( u).  By  EgorofPs  theorem  /m(u) 
converges  uniformly  to  /( u)  on  a  compact  subset  of  U,  say  ft,  having  arbitrarily  large  probability 
under  /.  But  then,  since  fm  is  continuous  on  Cl  we  have  /n  fm  — *•  fn  f.  Since  fm  — ♦  /  a.s. 
we  can  claim  such  convergence  of  integrals  for  almost  every  underlying  sequence.  In  the  case  of  a 
Metropolis- Hastings  algorithm,  to  obtain  similar  results  we  require  that,  for  almost  every  v,  q(v;  u) 
to  be  strictly  positive  and  continuous  where  /  is. 

3.2  Simulation  Results 


The  simulation  study  investigates  the  distribution  of  Jm'/2  =  /  |h  -  h\  with  h  as  in  (11),  as 
well  as  the  exact  value  of  J/2  =  /l^1)  -  h\.  We  confine  ourselves  to  the  Gibbs  sampler  and  to 
illustrations  which  are,  of  necessity,  simple  in  order  to  permit  exact  calculations  and  thousands  of 
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replications  within  a  reasonable  amount  of  computing  time.  In  fact  w»  take  /( u)  to  be  bivariate 
normal  or  a  bimodal  mixture  of  two  bivariate  normals  with  high  correlation  between  the  compo¬ 
nents,  a  situation  where  convergence  is  known  to  be  slow.  In  either  case  we  ran  the  sampler  drawing 
from  first  taking  to  be  N(hq,<t$).  We  chose  n 0  away  from  the  mean  of  f(u 2)  and 

considered  a\  both  smaller  and  larger  than  the  variance  of  f(u2)  in  order  to  examine  under  and 
overdispersed  initial  distributions  relative  to  /(u2)-  la  the  mixture  case  fi(1)(u1)  is  not  available 
explicitly.  Straightforward  calculation  yields 


jS>/2  =  //(* i)It 


h^\u  i) 


-  l|dtii. 


(12) 


Hence,  given  hW(«i)  and  h^(ui),  (12)  can  be  calculated  using  Monte  Carlo  integration  by 
taking  draws  from  /( tti);  2000  draws  were  used.  The  exact  value  of  f  can  also  be  calculated 

in  this  fashion.  If  fi(1)(ui)  is  not  available  explicitly  it  was  obtained  by  Monte  Carlo  integration 
using  draws  from  h(°)(u2).  500  draws  were  made  to  evaluate  the  integral  in  the  denominator  of 
right  hand  side  of  equation  (12). 

For  each  /  and  h(°\  the  simulation  involves  an  outer  loop  over  the  number  of  replicates  (we 
choose  1000  here).  Each  replicate  yields  a  random  value  of  Jm)  12.  For  each  replicate  we  ran  m  =  500 
parallel  strings  of  the  Gibbs  sampler.  Of  course  since  we  only  took  one  iteration  this  amounts  to 
making  500  draws  from  hi0)  after  which  h^^ui)  is  determined  and  (12)  can  be  calculated. 

For  the  single  bivariate  normal  case  we  took  /  =  h  —  HVJV(0, 0,1, 1,0.8)  and  hl°)( v?)  = 
N(2,cr$),  0$  ^  9,0.09.  For  the  mixture  case  our  h  =  0.4  x  HVJV(2.0,  -2.0,1, 1,0.8)  +  0.6  x 
BVN( 0,0, 1,1, 0.8)  with  hW(ua)  =  JV"(0, 1).  The  exact  value  of  7  and  some  features  of  the  dis¬ 
tribution  of  7m  ^  are  summarized  in  Table  1.  Note  that  the  starting  distribution  does  not  affect 
P(  7m^)  <  7)  which  equals  1  in  all  the  cases  indicating  the  effectiveness  of  the  the  adaptive  strategy. 
The  distribution  of  the  7m  ^  is  fairly  symmetric  and  lies  well  below  the  7  in  each  case.  Fixing  m, 
in  general,  performance  will  depend  upon  how  often  and  at  which  iterates  we  resample  as-  well  as 
the  dimension  of  u. 


{  Insert  Table  1  here  } 
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4  Acceleration  by  Changing  P 


Here  we  further  investigate  the  approach  of  adaptive  switching  of  transition  kernels  introduced 
in  Section  2.4.  The  goal  is  to  replace  the  current  kernel  Pi  by  a  new  one,  P2  which  we  expect  to 
accelerate  convergence.  We  envision  application  of  this  approach  to  the  Metropolis-Hastings  class  of 
algorithms  by  changing  ?(u;  v)  the  proposed  kernel.  As  noted  by  Tierney  (1994)  various  types  of  q’s 
are  useful.  For  example  if  v  is  generated  by  adding  a  random  increment  to  u,  drawn  according  to  a 
density  g  then  g(u;  v)  =  g(v  -  u).  If  g  is  elliptically  symmetric,  i.e.,  g(z)  =  g(zTAz)  then  switching 
might  mean  changing  A.  If  v  is  chosen  independently  of  the  current  u  then  g(u;  v)  =  p(v).  Here  g 
functions  similarly  to  an  importance  sampling  density.  Transition  depends  upon  the  relative  sizes 
of  the  weights  to(u)  =  and  w(v)  =  If  g  «  /  then  the  chain  produces  i.i.d  draws  from 
h.  Under  suitable  rescaling  of  u,  g  is  often  taken  to  be  a  multivariate  normal  or  t  density  so  that 
switching  would  change  the  mean  and  /or  covariance  of  g  to  make  g  more  resemble  /. 

We  remind  the  reader  that,  for  the  Metropolis-Hastings  algorithm,  any  such  g,  obtained  adap¬ 
tively  or  otherwise,  yields  a  chain  whose  invariant  distribution  is  h.  Hence  initial  adaptive  switching 
for  a  finite  number  of  transitions  does  not  affect  desired  ergodic  behavior.  In  Section  4.1  we  argue 
that,  in  finite  state  spaces,  for  given  choices  of  Pi  and  P3,  if  P3  is  better  than  Pi  in  the  sense  of 
Section  2.4,  then  for  a  fixed  total  number  of  transitions,  a  smaller  bound  on  variation  distance 
results  by  running  some  under  P2  than  all  under  Pi.  Here  we  extend  Theorem  1  of  Section  2.4. 
Of  course,  it  would  have  been  preferable  to  run  all  transitions  under  P2  but  a  potentially  better 
P3  is  only  identified  through  (parallel)  transitions  under  Pi.  The  fact  that  P3  is  thus  random  does 
not  violate  the  conclusions  of  Theorem  2  below.  However,  simulation  investigation  is  required  to 
demonstrate  that  such  adaptive  development  of  P3  does,  in  fact,  tend  to  yield  better  P’s.  This 
is  taken  up  in  Section  4.2  with  illustrations  using  a  random  increment  choice  for  g.  We  find  that 
adaptive  change  of  A  results  in  a  distribution  of  variation  distance  from  h  which  tends  to  yield 
smaller  values  than  the  fixed  variation  distance  without  switching.  The  work  of  Gilks  et  al.  (1992) 
incorporates  a  different  sort  of  adaptive  strategy  to  identify  a  potentially  better  Pi¬ 
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4.1  Theoretical  Results 

We  consider  the  rate  of  convergence  of  a  finite  state  space  Markov  chain  arising  under  a  finite 
deterministic  sequence  of  transition  matrices  Pi,  each  having  h  as  an  invariant  distribution.  It  is 
routine  that  such  a  chain  has  h  as  its  stationary  distribution.  We  illustrate  in  the  case  of  just  two 
Pi  where  Pi  is  run  for  the  first  s  iterations  and  then  P2  for  the  next  t  —  s  iterations.  We  obtain  a 
bound  on  the  variation  distance  at  the  t?*1  iteration  as  in  Theorem  1  which  may  be  compared  with 
(9).  The  reader  will  readily  see  extension  to  more  than  two  Pi’s.  Recalling  the  notation  surrounding 
Theorem  1,  we  have  for  all  j  and  1  <  s  <  t, 

Theorem  2: 

(D(jW>*-m)  <  (i3) 

where  (Ptf-)*  =  l- 

Proof:  Following  Diaconis  and  Stroock  we  have 

4  (£  {p;p‘r)]k  - 1. 

But  hj(Pf)n  =  hi(I^)ij,i  =  1,2  and  for  any  integer  k  >  1.  Hence  it  is  easy  to  see  that 

WP'Pf'h  =  h(P2~‘P')u>  “that 

(jyjrOa  (p-^-pn.j 

k  hi 

Let  D  be  a  diagonal  matrix  with  j**1  diagonal  entry  y/Kj.  Let  DPiD~x  =  TiBiTj  where  P»  is 
diagonal  with  eigenvalues  of  Pi  and  TiTf  =  I.  Therefore, 

(p;i f '->J?  )  =  £  £  «*%«*  (14) 

”  l  k 

where  a ji  =  (Tx-BiTf  ),•/  and  bit  =  (I^PalT)**- 

Now,  if  (Bi)x*  =  1,  =  Ti +  (/3i  ^)*  Hence 

if/#  7  <  4  Ml-  (^•))*)Ti^r1/,  =  (1  -  while  if  /  =  j  a.jj  <  =  T*Jm  + 

(M’Va-rU)  =  (l-tfh'fc+tfP)'-  Similarly  if  A  #  /  %  <  ^  while 

ifk  =  l  bu  <  b'u  =  (1-  Hence  (14)  is  bounded  by  Sfc  ajib'ikajk  which 

after  a  bit  of  algebraic  manipulation  becomes  (l  -  (/3i*Y*(^2*Y^~^)  /M0®i*Y*(/?2*Y^_*^  Thus 
the  bound  in  (13)  follows. 
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Comparing  theorems  1  and  2,  if  <  @[*\  a  tighter  bound  on  the  variation  distance  arises  by 
switching.  In  this  sense  we  argue  that  switching  accelerates  convergence. 

4.2  Simulation  Results 

The  simulation  study  is  patterned  after  that  of  Section  3.2  taking  u  to  be  three  dimensional,  high 
enough  to  allow  interesting  structure  while  again  keeping  computations  manageable.  We  assume 
that  /  (=  h,  here)  is  either  trivariate  normal  or  a  bimodal  mixture  of  two  trivariate  normals.  We 
utilize  a  random  increment  trivariate  normal  proposal  transition  kernel  with  covariance  matrix 
identity.  We  examine  the  benefit  of  adaptive  switching  in  the  following  illustrative  way.  We  first 
run  m  parallel  strings  each  for  t  Metropolis  steps  obtaining  u =  1, 2, . . . ,  m.  Secondly,  using  the 
output  of  the  first  s  Metropolis  steps  u =  1, 2, . . . ,  m,  we  co*r  i  ie  the  sample  covariance 
matrix  of  the  .  We  then  switch  the  covariance  matrix  of  the  proposal  transition  kernel  to  this 
£u  and  run  an  additional  t  —  s  Metropolis  steps  to  obtain  =  1, 2, . . . ,  m.  We  compare  the 

Ll  distance  between  /  and  the  density  fcW  of  the  with  that  between  /  and  the  density  of 
i.e.,  jW  =  /  —  /j  and  jW(En)  =  /  \h*W  —  f\.  Of  course  both  &W  and  h*M  are  unknown 

and  is  random,  JW  can  be  calculated  arbitrarily  accurately  by  making  m  very  large,  obtaining 
a  kernel  density  estimate  h.W  Qf  fcW  and  then  using  Monte  Carlo  integration  with  draws  from  / 
to  calculate  / 1  fcW  —  f)  =  J  /|^r  —  1|.  On  the  other  hand  since  En  varies  with  the  sample,  u^, 
we  treat  as  random.  That  is,  here  (as  in  practice)  we  do  not  take  m  so  large  that  En  is 

essentially  the  covariance  of  u^,  whence  the  new  transition  kernel  is  essentially  fixed.  Instead  our 
simulation  replicates  adaptive  switching  to  obtain  the  distribution  of  j(*)(£11)  and  to  compare  with 
/W.  The  JW(E u)  are  still  computed  by  Monte  Carlo  integration. 

Our  two  illustrative  cases  are  as  follows,  Case  I:  /( u)  =  h(u)  =  ./V3(/i,I  +  911T)  where  /*  = 
(— 5,5, 15)T  and  Case  II:  /( u)  =  h( u)  =  0.4  x  W3(0,I+  411T)  +  0.6  x  W3(  151,1  +  911T),  where  I 
is  the  identity  matrix  and  1  is  a  column  vector  of  ones.  In  either  case  our  starting  proposal  density 
is  a  unit  three  dimensional  normal  distribution.  As  before  we  ran  (m  =)  500  parallel  chains.  We 
have  taken  1000  draws  to  calculate  the  distances.  We  have  replicated  1000  /^(Eu)  to  evaluate 
the  probability  that  the  adaptive  distance  is  less  than  the  actual  distance.  For  Case  I,  we  take 
s  =  10,  f  =  30  while  for  Case  II  we  take  s  =  50,  t  =  110.  Table  2  summarizes  the  encouraging 
findings.  For  a  fixed  m,  in  general,  performance  is  sensitive  to  s,  t  and  the  d  snsion  of  u. 
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{  Insert  Table  2  here  } 


5  Two  Examples 

Here  we  consider  two  examples  employing  the  proposed  accelerators.  The  first  -*s  resam¬ 
pling  in  fitting  a  non  linear  model  with  a  very  badly  behaved  likelihood;  the  second  strates 

the  collapse  of  ergodic  behavior  under  infinitely  often  stochastic  adaptation  of  the  transition  kernel. 

5.1  A  Nonlinear  Model 

Bates  and  Watts  (1988  page  110-121)  discuss  the  analysis  of  a  data  set  on  the  utilization  of  nitrite 
in  bush  beans  as  a  function  of  light  intensity  on  each  of  two  days.  Assuming  zero  utilization  at 
zero  light  intensity  and  an  asymptote  as  light  intensity  increases,  they  investigate  several  nonlinear 
normal  models  with  homogeneous  variance  using  maximum  likelihood  analysis.  A  Michaelis-Menten 
form,  which  provided  adequate  fit,  models  the  mean  utilization  as  (0i  +  <t>x2)zi/(02  4-  xi  + 
where  x\  is  the  light  intensity  and  x2  =  0  for  day  1,  =  1  for  day  2.  The  likelihood  analysis  yielded 
the  summary  given  in  Table  3.  The  mean  square  error  yields  the  variance  estimate  a3  =  647723.0. 
The  correlation  matrix  indicates  a  very  poorly  behaved  likelihood  surface;  in  fact  as  we  shall  see, 
the  surface  is  more  pathological  than  the  authors  realized. 

{  Insert  Table  3  here  } 

We  perform  a  Bayesian  analysis  with  extremely  vague  priors  on  the  different  parameters,  so 
that,  up  to  standardization,  the  posterior  density  is  essentially  the  likelihood  surface.  We  show  the 
benefit  of  resampling  using  the  Gibbs  sampler.  The  complete  conditionals  for  0i,  <j>,  a2  are  standard. 
For  02  and  0$  we  used  Metropolis  subchains  for  50  steps  with  Gaussian  proposals  having  standard 
deviations  essentially  the  asymptotic  standard  errors.  We  eschewed  reparametrization  since  this 
would  sacrifice  easy  sampling  for  0\,<f>  and  a2.  Initially  we  started  ten  chains  in  the  vicinity  of  the 
MLE.  Autocorrelation  in  any  individual  chain  is  extremely  high,  requiring  roughly  1500  iterations 
to  die  down  to  insignificance.  Because  the  joint  posterior  for  0i,02  and  $3  is  so  nearly  singular, 
even  after  several  hundred  thousand  iterations  of  the  ten  chains  (requiring  more  than  24  hours 
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of  run  time),  we  cannot  capture  tail  behavior  adequately  or  conclude  convergence.  However,  as 
an  approximate  benchmark  we  obtained  kernel  density  estimates  for  the  6i  and  4>  based  upon  a 
sample  of  size  500,  50  from  each  chain  using  every  1500  th  iteration  after  100,000.  These  estimates 
appear  as  the  solid  curves  in  figure  1.  We  then  ran  500  parallel  chains  out  to  100  iterations  doing 
no  resampling  as  well  as  500  parallel  chains  out  to  100  iterations  resampling  once  at  iteration  5. 
The  kernel  density  estimates  for  the  and  <f>  without  resampling  appears  as  the  dashed  curves  in 
figure  1,  those  with  resampling  as  the  dotted  curves.  Cleariy,  even  a  single  resampling  is  beneficial. 
Lastly  note  that,  the  standard  errors  of  the  MLE’s  supplied  by  Bates  and  Watts  are  not  useful, 
e.g.,  §i  ±  3SE ^  includes  approximately  50  %  of  the  marginal  posterior  mass  for  0x. 

5.2  A  Pathological  Example 

We  remarked  in  Sections  2.3  and  2.4  that,  when  stochastic  adaptation  is  employed  at  each  transition, 
even  if  h(  u)  is  the  unique  stationary  distribution  for  each  adaptive  transition  kernel,  this  does 
not  assure  that  the  resultant  MCMC  algorithm  converges  or,  if  it  does,  that  it  converges  to  h. 
Analytic  examination  of  non-stationary  Markov  chains  with  stochastic  choice  of  transition  kernel 
is  generally  infeasible.  However,  we  can  give  a  simple  example  due  to  G.  0.  Roberts  (personal 
communication)  where  the  additional  randomness  introduced  in  the  selection  of  the  transition 
kernel  changes  the  overall  ergodic  behavior  of  the  chain.  Assume  u  is  two  dimensional  with  /(u)  = 
h(u)  =  HVAT(0,0, 1,1,  p),  p  >  0.  Suppose  P\  is  the  transition  kernel  associated  with  the  Gibbs 
sampler,  i.e.,  we  draw  ~  N(pt 4t-1\  1  -  p2)  and  ~  W {pt^\  1  -  p2).  Then  /  is  the 

invariant  distribution  associated  with  Pi.  Suppose  P3  is  the  transition  kernel  corresponding  to  exact 
sampling  of  u  through  the  principal  components  transformation  v\  =  itx  +tta,  uj  =  «x  —  uj.  That 
is  vi  and  v3  are  independent  with  Vi  ~  N(0,2(l4-p)),V2  ~  1V(0, 2(1 -/>)).  Inverting  (v)T  =  (vi,va)T 
to  solve  for  (u)T  =  (vx,  1*3 )T  yields  a  draw  from  /. 

Now  suppose  at  iteration  t  we  take  Pi  if  u^-1^  >  0  ,P3  otherwise.  A  trajectory  from  this  chain 
will  tend  to  show  more  positive  than  negative.  This  is  dear  since  if  <  0  we  have  a  0.5 
chance  that  >  0  while  if  >  0  simple  calculation  shows  that  ~  N(p2i^~l\  1— p*) 

whence  has  a  chance  greater  than  0.5  of  being  positive.  Hence  the  ergodic  average  of  the 
will  be  positive  and  the  ergodic  estimate  of  P(ui  >  0)  >  |.  Thus  ux  can  not  have  N( C,  1)  as 
its  stationary  distribution.  In  fact,  by  simulation  investigation  using  1000  parallel  strings  each 
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starting  from  ~  N(0, 1),  the  supposed  stationary  distribution,  we  can  observe  the  marginal 
distribution  of  at  various  t’s  as  well  as  the  behavior  of  ergodic  estimates  under  this  scheme. 
Table  4  illustrates  such  calculations  for  p  =  0.9. 


{  Insert  Table  4  here  } 
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Table  1.  Simulation  for  Resampling 


h 

Normal 

Normal 

Mixture 

2) 

N(  2,9) 

1V(2,0.09) 

mi) 

J 

0.7626 

1.3010 

0.2485 

p(4»1})  <  J) 

mm 

iRKSflii 

BSB 

0.11 

1.0658 

0.1542 

Var(jW) 

0.0012 

0.00036 

0.00056 

0.1075 

1.0661 

0.1543 
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Table  2.  Simulation  for  switching  P 


h 

Normal 

Mixture 

jto 

1.6335 

1.3260 

P(J{t)(Xu)  <  J(1)) 

min 

1.0 

f  E(JW( En)) 

1.0251 

1.1014 

Var(jW(Eu)) 

0.0190 

0.0016 

Med(jW(Eu)) 

1.0056 

1.0976 

Table  3.  Parameter  Summary  for  the  Michaelis-Menten  model 


Parameter 

Estimate 

Standard 

Error 

t  Ratio 

Correlation 

Matrix 

0i 

70096 

16443 

4.3 

mwjm 

02 

139.4 

39.3 

3.6 

1.00 

1.00 

03 

0.01144 

0.00404 

2.8 

0.99 

0.99  1.00 

* 

-5381 

1915 

-2.8 

-0.69 

-0.66  -0.66  1.00 

Table  4.  Simulation  for  the  pathological  example  p  =  0.9 


EEtfl 

ES8H31 

HESfil 

5 

0.5957 

0.767 

0.4840 

0.7232 

10 

0.5953 

0.777 

0.5268 

0.7390 

50 

0.6018 

0.750 

0.5710 

0.7516 

Hj£g£iiBSi 

0.6047 

0.784 

0.5796 

0.7534 

500 

0.5553 

0.744 

0.5895 

0.7572 

0.6093 

0.777 

0.5938 

0.7589 
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Figure  1:  Estimated  Posteriors  for  the  Nonlinear  Model  of  Section  5.1 
—  =  "true",  —  =  no  resampling,  ...  =  one  resampling. 
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